Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_18

18. Multivariate Longitudinal Microbiome Models

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

Chapter 15 mainly introduced the glmmTMB, GLMMadaptive, and FZINBMM packages. However, all these three packages can only analyze univariate longitudinal microbiome data. This chapter first provides an overview of multivariate longitudinal microbiome analysis. Then it introduces the non-parametric microbial interdependence test (NMIT) and illustrates its implementation using R and QIIME 2. The large P small N problem is also discussed in this chapter.

Keywords

Multivariate longitudinal microbiome analysis Multivariate distance/kernel-based Longitudinal models Multi-omics methods Univariate analysis Multivariate analysis Non-parametric microbial interdependence test (NMIT) Correlated sequence kernel association test (cSKAT) Correlation matrix Distance matrix Large P small N problem GLMM-MiRKAT aGLMM-MiRKAT Frobenius norm qiime longitudinal nmit Kendall correlation method Pearson correlation method Spearman correlation method p-corr method

In Chap. 15, we focused on linear mixed-effects models (LMMs), one of most widely univariate longitudinal model in classical statistical literature and has recently been applied into microbiome data analysis. In Chap. 16, we comprehensively introduced the estimation and modeling in the generalized linear mixed models (GLMMs), and in Chap. 17 we introduced and illustrated how to use GLMMs for modeling microbiome data via the glmmTMB, GLMMadaptive, and FZINBMM packages. However, all these three packages can only analyze univariate longitudinal microbiome data. In this chapter, we first provide an overview of multivariate longitudinal microbiome analysis (Sect. 18.1). We then introduce non-parametric microbial interdependence test (NMIT), one of newly developed multivariate longitudinal microbiome analysis methods, and illustrate how to implement NMIT using R and QIIME 2 (Sect. 18.2). In Sect. 18.3, we discuss the large P small N problem. Finally, we complete this chapter with a brief summary in Sect. 18.4.

18.1 Overview of Multivariate Longitudinal Microbiome Analysis

Currently longitudinal models for discovering dynamic nature of the multivariate microbiome data are still rare. In this section, we review some newly developed multivariate longitudinal methods/approaches for modeling dynamic microbiomes.

18.1.1 Multivariate Distance/Kernel-Based Longitudinal Models

Recently the kernel machine regression framework has been extended to test the association between the outcomes and the microbiome community in longitudinal setting. Below, we introduce some longitudinal multivariate distance/kernel-based association tests of microbiome data.

Correlated sequence kernel association test (cSKAT) (Zhan et al. 2018) was proposed to directly test microbiome association with related outcomes either from longitudinal and family studies using the linear mixed-effects models (LMMs). It employs (1) random effects to account for the outcome correlations and the effect of covariates and (2) a small-sample adjusted kernel variance component score test called cSKAT to account for high dimensionality, address the problem of small samples but large number of association tests. cSKAT is flexible, allowing both longitudinal and clustered data to be analyzed. However, cSKAT is limited to a continuous outcome and to the item by-item use of the ecological distances (Koh et al. 2019). Because its test may be conservative, it is not a perfect exact test (Zhan et al. 2018). Actually, cSKAT is still not a multivariate longitudinal method because it is limited to a continuous outcome but cannot model multiple taxa (multiple outcomes) simultaneously.

GLMM-MiRKAT and aGLMM-MiRKAT (Koh et al. 2019) are a distance-based kernel association test and its data-driven adaptive test based on the generalized linear mixed model (GLMM) (Breslow and Clayton 1993). They were built on two statistical frameworks: (1) GLMM and (2) ecological distance/dissimilarity for microbial community analysis to model data dependency due to clusters (repeated measures) and to account for the within cluster correlation in responses. Because GLMM-MiRKAT uses the framework of GLMM, it not only can model Gaussian distributed host outcomes but also can handle non-Gaussian host outcomes such as binomial and Poisson regression models. In the sense of generalization, GLMM-MiRKAT is an extension of cSKAT (Zhan et al. 2018). Thus, GLMM-MiRKAT is able to analyze the association between microbial composition and various host outcomes adjusting for covariates. By using framework of ecological distances, e.g., Jaccard (Jaccard 1912) /Bray-Curtis dissimilarity (Bray and Curtis 1957), unique fraction distance, GLMM-MiRKAT can handle the multiple features or taxa in a multivariate approach. Furthermore, GLMM-MiRKAT uses the kernel trick (Cristianini and Shawe-Taylor 2000) and performs a variance component test (Lin 1997) to deal with the large P and small N problem due to high dimensionality. The aGLMM-MiRKAT adapts the test statistic of the minimum P-value from multiple item-by-item GLMM-MiRKAT analyses so that it avoids the need to choose the optimal distance measures from various abundance-based and phylogeny-based distances. The aGLMM-MiRKAT is useful to detect diverse types of host outcomes with robust power and valid statistical inference (Koh et al. 2019). However, one drawback of aGLMM-MiRKAT is just due to its adaption approach: in practice, adaption of a minimum P-value ignores the natures of test statistics and the distance measures. GLMM-MiRKAT and aGLMM-MiRKAT ignore compositional nature of taxa and treat taxa or OTUs as independent. Additionally, the practice of its data processing is also arbitrary such as excluding measurements with low sequencing depth (i.e., <10,000 total reads) and removing OTUs with average relative abundance <10⁻⁵.

18.1.2 Multivariate Integration of Multi-omics Methods

Lê Cao and her team (Bodein et al. 2019) propose a generic multivariate framework to integrate microbiome data with multi-omics datasets for longitudinal studies. This generic data-driven framework consists of data pre-processing, modeling, data clustering, and integration. The framework was mainly built on two statistical techniques: (1) a linear mixed model (LMM) including smoothing splines to model profiles across groups of samples and (2) sparse multivariate ordination methods to identify sets of variables that are highly associated across the data types and across time. Due to using linear mixed model splines (LMMS) with multivariate dimension reduction techniques, the proposed framework (Bodein et al. 2019) (1) is able to reduce the data dimension and to account for the individual variability and hence can identify the main patterns of longitudinal variation and (2) is also able to analyze data at different time points. However, it also have several limitations (Bodein et al. 2019), including (1) in this framework the data is actually fitted with simple linear regression models because a high individual variability between biological replicates limits the LMMS modeling step. (2) A large number of time points can result in noisy profiles and clusters are modeled which is often due to high individual variability. (3) The overall performance would be optimal for regularly spaced time points in the longitudinal omics experiments, although it was shown that clustering seems to not be impacted by the LMMS interpolation of missing time points. And (4) the issue of analyzing time-course compositional data have not been fully addressed.

The limitations of multivariate distance/kernel-based longitudinal models and multivariate integration of multi-omics methods really highlight the challenges in multivariate longitudinal microbiome data: (1) how to analyze multiple taxa simultaneously; (2) how to collect and integrate multiple omics data at different time points; (3) how to model multiple taxa as the response variables; and (4) how to deal with the large P small N problem (see Sect. 18.3 for details).

18.1.3 Univariate Analysis Versus Multivariate Analysis

Different from the univariate approach of microbiome data analysis, which analyzes each single microbial taxon separately or one-by-one, multivariate analysis analyzes multiple microbial taxa simultaneously. However, when there are too many microbial taxa of interest, it is difficult to directly analyze them simultaneously. In such cases, in general, two strategies could be taken: (1) reduce dimensions directly through dimension reduction techniques such as clustering and ordinations (e.g., PCA or PCoA, NMDS) and (2) use summary statistics such as distance/dissimilarity measures (e.g., Bray-Curtis dissimilarity) and/or correlation analysis (e.g., Kendall, Pearson, Spearman). For the details of using the approach of dimension reduction techniques, the readers are referred to Chap. 10. For the approach of using summary statistics, please see the following NMIT analysis.

In either way, the statistical hypothesis testing will be performed on the measured values, the summary statistics, or the reduced spaces (e.g., principal components or clusters) rather than directly on taxa (e.g., genes or species).

Both univariate and multivariate approaches have their own advantages and disadvantages. Multivariate methods can analyze all taxa simultaneously and deal with the simultaneous relationship among taxa. In general, univariate methods use the mean or median and the variance of a single taxon, while multivariate methods use covariances or correlations of multiple taxa. Microbiome data are multivariate. Thus, multivariate analysis methods are more appropriate to deal with a significant amount of collinearity among taxa in the data matrix.

18.2 Nonparametric Microbial Interdependence Test (NMIT)

Multivariate distance-based framework can be extended to test the association between the outcomes and the microbial community in longitudinal setting. NMIT is one of newly developed multivariate longitudinal distance-based method.

18.2.1 Introduction to NMIT

NMIT (Zhang et al. 2017) is a multivariate distance-based nonparametric longitudinal model and provides a nonparametric group comparison for microbial interdependence to adequately capture the dynamic nature of the microbiome data.

NMIT aims to identify differences in microbial interdependent relationships between groups over time. NMIT is used to compare temporal microbial interdependence structures between groups and test its association with covariates which are either binary outcome (e.g., disease status, gender, or case-control group indicator) or disease-associated continuous environmental factors or quantitative variables (e.g., age, BMI, blood pressure or biomarker measurement).

The NMIT framework consists of two major steps of statistical analysis or testing, one step of network analysis and one step of heatmap for result visualization. First, the core part of NMIT is to perform pairwise correlation analysis within each subject using the longitudinal microbial measurements to capture the individual microbial dependencies over time. Second, perform permutation MANOVA (Anderson 2001; McArdle and Anderson 2001; Tang et al. 2016) to test whether the correlation structure is different between groups or associated with an interested outcome or not. Third, use network analysis to visualize microbial dependency in different groups, and finally, use heatmap to visualize the results of the differences of temporal correlation structure between groups.

18.2.1.1 Calculate Interdependence Correlation Matrix

The goal of NMIT is to test the association between the temporal microbial interdependence profiles and the covariates of interest. NMIT assumes that the relative abundance of subjects for taxa over time points have been collected. Let Y_ij = (y_ij1, …, y_ijm)^T denote the relative abundances of taxa M for the ith subject at time J, and let X_i = (x_i1, …, x_ip) denote P covariates collected for subject i.

Under this temporal microbial interdependence setting, NMIT first calculates the M × M pair-wised taxa correlation matrix to capture the interdependence profile for each subject. Let u_{i; m,m′} be the temporal correlation between taxa m and m′, and then let ${U}_i={\left\{{u}_{i;m,{m}^{\prime }}\right\}}_{m,{m}^{\prime }=1}^M$ denote the M × M temporal correlation matrix of subject i. The core part of the NMIT is to construct the interdependence correlation matrix for each subject to summarize their microbial interdependent correlation structures using the repeated microbiome measurements. NMIT has evaluated three correlation methods: (1) Pearson correlation for the linear association between two random variables; (2) the non-parametric Kendall’s rank correlation for the nonlinear association among microbial species; and (3) the maximal information coefficient (MIC) (Reshef et al. 2011), measuring how much information two random variables share to capture both linear and nonlinear associations between two continuous variables.

18.2.1.2 Calculate Distance Matrix

NMIT then calculates the distance matrix $D={\left\{{d}_{i{i}^{\prime }}\right\}}_{i,{i}^{\prime }=1}^N$ , which measures the temporal correlation distance between subjects i and i′. NMIT has considered many distance metrics to calculate the distance of two correlation matrices. For example, one commonly used distance is Frobenius norm which is defined as:

${d}_{i{i}^{\prime }}={M}^{-1}\sqrt{\sum_{m=1}^M{\sum}_{m^{\prime}}^M{\left({u}_{i;m,{m}^{\prime }}-{u}_{i^{\prime };m,{m}^{\prime }}\right)}^2}.$

(18.1)

18.2.1.3 Perform Statistical Hypothesis Testing of Distance Matrix

Next, with any predetermined distance metric, NMIT performs permutation MANOVA to compare whether the distances between subjects within groups are shorter than the distances between subjects across groups to infer the group difference in terms of microbial interdependence profile. Specifically let X = (X₁, …, X_N)^T be the N × P covariate matrix, then the projection matrix is H = X(X^TX)⁻¹X^T and the centroid matrix is G = (I – 11^T/n)A(1 – 11^T/n), in which A is the adjacent matrix of D defined as $A=\left\{{a}_{i{i}^{\prime }}\right\}=\left\{-{d}_{i{i}^{\prime}}^2/2\right\}$ , i = 1, …, N, I is an N × N identity matrix, and 1 is an N by 1 vector with all elements being 1. The pseudo-F statistics can be constructed as:

$F=\frac{tr(HGH)}{tr\left[\left(I-H\right)G\left(I-H\right)\right]},$

(18.2)

where tr(∙) is the trace of a matrix.

Statistical significance testing the null distribution of the pseudo-F statistics is difficult. Typically permutation strategy is used to assess the statistical significance. In the literatures of metagenomics and genetics, permutation test has been widely adopted (Reiss et al. 2010), such as to analyze single time microbial location differences. Like in other studies of metagenomics and genetics, NMIT uses permutation test to perform statistical significance testing of distance matrix.

18.2.2 Implement NMIT Using R

The syntax of NMIT is given as:

NMIT(otu, id.var, cov.var, time.var, method = "kendall", dist.type = "F", heatmap = T, classify = F, fill.na = 0)

where the argument:

otu is a matrix or data frame of OTU table.
id.var is a vector of subjects.
cov.var is a vector of covariates.
time.var is a vector of time variable.
method is used to specify the correlation method (“pearson”, “kendall”,”spearman”). The default method is “kendall”.
dist.type is a character string, which is used to specify the type of matrix norm to be computed. The default is “F”or “f” which specifies the Frobenius norm (the Euclidean norm of x treated as if it were a vector). Specifying M″ or “m” is for the maximum modulus of all the elements in x. Specifying “O”, “o” or “1” is for the one norm (maximum absolute column sum) and “I” or “i” is for the infinity norm (maximum absolute row sum).
heatmap is a logical value indicating whether to draw heatmap, by default is TRUE.
classify is a logical value indicating whether to draw classifier tree, by default is FALSE.
fill.na is a number between 0 and 1 to fill the missing value; by default the missing value is filled with 0.

After implementing NMIT ()function, it returns a typical, but limited, output for analysis of variance (general linear models), including:

aov.tab is a typical AOV table showing sources of variation, degrees of freedom, sequential sums of squares, mean squares, F statistics, partial R-squared, and P-values, based on N permutations.
coefficients is a matrix of coefficients of the linear model, with rows representing sources of variation and columns representing taxa (i.e., OTU, genus, species); each column represents a fit of a taxon abundance to the linear model. These are what we typically obtain when we fit one taxon to the predictors. These are only available when the sample by taxa (i.e., site x species) matrix is provided, while they are NOT available when the distance matrix in the formula is supplied.
coef.sites is a matrix of coefficients of the linear model, with rows representing sources of variation and columns representing samples or sites; each column represents a fit of sample or site distances (from all other sites) to the linear model. These are what we typically obtain when we fit distances of one site to the predictors.
f.perms is an N by m matrix of the null F statistics for each source of variation based on N permutations of the data. The permutations can be inspected with permustats and its support functions.
model.matrix is the model.matrix for the right hand side of the formula.
terms is the terms component of the model.

Example 18.1: Longitudinal Pregnant Women’s Vaginal Microbiota Study

Romero et al. (2014) conducted a longitudinal study using 16S rRNA sequencing to compare the different vaginal microbiota for the pregnant women who had a term delivery and those women who had a preterm delivery. The study collected 72 pregnant women who had a term delivery and 18 women who had a spontaneous preterm delivery. In this study, the interested variable is the binary case-control indicator (the group variable) for term versus preterm delivery.

The original study concluded that the healthy pregnant women had a higher abundance of vaginal microbiota Lactobacillus spp. than nonpregnant women. Here, we use this data to illustrate NMIT method to examine whether the pregnant women present a different microbial interdependence profile from those non-pregnant women.

First, we set up working directory and install the NMIT package (latest version 0.1, 2016-11-19).

> setwd("~/Documents/QIIME2R/Ch18_NMIT")

> library(devtools)

> install_github("elong0527/NMIT")

> library(NMIT)

Then we can use the functions NMIT() or NMIT_phyloseq() to perform the Nonparametric Microbial Interdependence Test (NMIT). By using NMIT(), we need a matrix of OTU table and sample/meta dataset as input data, while by using NMIT_phyloseq(), the input data need to be the phyloseq data structure (phyloseq object). Here we only illustrate the NMIT() function.

We can take five main steps to perform the NMIT() function.

Step 1: Load OTU table and sample dataset.

> otu_tab <- read.csv("RomeroOTU.csv", row.names = 1)

> meta_tab <- read.csv("RomeroSampleData.csv",row.names = 1

> head(otu_tab,3)

Lactobacillus.iners Lactobacillus.crispatus Atopobium.vaginae Lactobacillus

1 1050 2 0 791

2 2029 0 0 903

3 1425 0 0 672

> head(meta_tab,3)

Subect_ID Sample_ID GA_Days Age Race Nugent.Score CST Total.Read.Counts

1 N001 33604 19.29 19 1 0 II 4338

2 N001 35062 23.29 19 1 0 II 4610

3 N001 36790 27.71 19 1 0 II 3596

pregnant

1 1

2 1

3 1

> colnames(meta_tab)

[1] "Subect_ID" "Sample_ID" "GA_Days"

[4] "Age" "Race" "Nugent.Score"

[7] "CST" "Total.Read.Counts" "pregnant"

> dim(meta_tab)

[1] 900 9

> dim(meta_tab)

[1] 900 9

Step 2: Filter data and obtain the subject level taxa relative abundance over time.

Originally, there are 900 samples, 143 taxa. We remove 3 NAs in GA_Days and filter prevalent taxa with relative abundance >0.1% in ≥20% of all samples, thus resulting in 143 taxa in the 897 samples. Among these 897 samples, 139 were from pregnant women and 758 were from nonpregnant women.

> # Remove 3 NAs in GA_Days

> meta_tab = meta_tab[complete.cases(meta_tab[,3]), ]

> colnames(meta_tab)

[1] "Subect_ID" "Sample_ID" "GA_Days"

[4] "Age" "Race" "Nugent.Score"

[7] "CST" "Total.Read.Counts" "pregnant"

> dim(meta_tab)

[1] 897 9

> table(meta_tab$pregnant)

0 1

758 139

Either counts or relative abundance data can be used in the otu_table. Here we use relative abundance data and illustrate two approaches to convert absolute read counts into relative abundance values. In the following commands, we use the decostand () function from the vegan package to convert read counts into relative abundances.

> library(vegan)

> otu_tab_rel <- decostand(otu_tab, method = "total")

> head(otu_tab_rel,3)

Lactobacillus.iners Lactobacillus.crispatus Atopobium.vaginae Lactobacillus

1 0.2420 0.000461 0 0.1823

2 0.4401 0.000000 0 0.1959

3 0.3963 0.000000 0 0.1869

We can confirm whether or not this conversion is successful.

> head(apply(otu_tab_rel, 1, sum),3)

1 2 3

1 1 1

We can also use the apply() function to convert read counts into relative abundances below:

> otu_tab_rel1<-apply(otu_tab,1,function(X){X/sum(X)})

> head(otu_tab_rel1,3)

1 2 3 4 5 6

Lactobacillus.iners 0.242047 0.4401 0.3963 0.4054 0.327488 0.5566161

7 8 9 10 11 12

Lactobacillus.iners 0.3354928 0.4877345 0.06685 0.003204 0.002233 0.446091

13 14 15 16 17

Lactobacillus.iners 0.185113 0.5519315 0.9528253 0.9089599 0.9395604

We can confirm whether or not this conversion is successful.

> head(apply(otu_tab_rel1, 2, sum),3)

1 2 3

1 1 1

By default, the function NMIT() in NMIT package uses the OTUscreen() function to keep the major taxa with a predefined error rate and percentage threshold. The syntax of this function is

OTUscreen (ana, error_rate = 0.1, pct_threshold = 20).

where the argument:

ana is a matrix of OTU table or a phyloseq object with counts/relative abundance OTU table depending on using NMIT() or NMIT_phyloseq().
error_rate is error rate percentage; the default is 0.1%.
pct_threshold is the occurrence percentage threshold; the default percentage threshold is 20%.

By implementing the function NMIT(), a matrix with major taxa has been used in the analysis. We can check subsetting in the modeling results with the information of coef.sites or model.matrix.

Step 3: Compute the subject level taxa temporal correlation.

In this step, for each subject, a temporal correlation between any pair of taxa are computed. This step will return a list of temporal correlation matrix for each subject. The OTU temporal interdependence for each subjects is obtained by running the function tscor(). The syntax of tscor() is given:

tscor(ana, method='kendall', subject_var, fill.na = 0)

where the argument:

ana is a matrix of OTU table or a phyloseq object of counts/relative abundance data.
method is used to specify an option of the correlation method (“pearson”, “kendall”, “spearman”). In NMIT, three correlation methods are available: Pearson, Spearman, and Kendall. The default method is “spearman.” The argument subject_var is a numeric vector of subject. The argument fill.na is a number between 0 and 1 to fill the missing value. The default value is 0.

Step 4: Calculate the distance between temporal correlation matrices.

This step is implemented in the NMIT() function. In this step, the distance between any two correlation matrices in Step 3 is to be calculated based on the specified dist.type (the distance measure). This must be one of “Euclidean,” “maximum,” “manhattan,” “Canberra,” “binary,” or “minkowski.” Frobenius norm is the default type.

Step 5: Perform permutation MANOVA to test the differences of distances between subjects obtained in Step 4.

The permutation MANOVA is implemented via the adonis() in vegan package, which is called in the NMIT() function. In the NMIT() function, a heatmap function is also developed through calling the gplots::heatmap.2() to show the within-group distances and the between-group distances. We can choose to draw heatmap by specifying the value is “TRUE” or “FALSE”. The default value is TRUE.

> # Fit method = "spearman"

> set.seed(123)

> fit_s<-NMIT(

+ otu=otu_tab_rel,

+ id.var=meta_tab$Subect_ID,

+ cov.var= meta_tab$pregnant,

+ time.var= meta_tab$GA_Days,

+ method = "spearman",

+ dist.type = "F",

+ heatmap = F,

+ classify = F,

+ fill.na = 0

+ )

We can retrieve the ANOVA table below:

> fit_s$aov.tab

Permutation: free

Number of permutations: 999

Terms added sequentially (first to last)

Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)

cov.var 1 711 711 1.6 0.03 0.007 **

Residuals 52 23151 445 0.97

Total 53 23862 1.00

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Other results can be obtained such as using the following commands to retrieve coef.sites, f.perms, and model.matrix:

> fit_s$coef.sites

> fit_s$f.perms

> fit_s$model.matrix

> # Fit method = "kendall"

> set.seed(123)

> fit_k<-NMIT(

+ otu=otu_tab,

+ id.var=meta_tab$Subect_ID,

+ cov.var= meta_tab$pregnant,

+ time.var= meta_tab$GA_Days,

+ method = "kendall",

+ dist.type = "F",

+ heatmap = F,

+ classify = F,

+ fill.na = 0

+ )

> fit_k$aov.tab

Permutation: free

Number of permutations: 999

Terms added sequentially (first to last)

Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)

cov.var 1 583 583 1.53 0.029 0.007 **

Residuals 52 19806 381 0.971

Total 53 20389 1.000

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> # Fit method = "pearson"

> set.seed(123)

> fit_p<-NMIT(

+ otu=otu_tab_rel,

+ id.var=meta_tab$Subect_ID,

+ cov.var= meta_tab$pregnant,

+ time.var= meta_tab$GA_Days,

+ method = "pearson",

+ dist.type = "F",

+ heatmap = F,

+ classify = F,

+ fill.na = 0

+ )

> fit_p$aov.tab

Permutation: free

Number of permutations: 999

Terms added sequentially (first to last)

Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)

cov.var 1 700 700 1.46 0.027 0.005 **

Residuals 52 24986 481 0.973

Total 53 25686 1.000

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The above permutation MANOVA test showed that the pregnancy can explain the distances between subjects with P-values of 0.007, 0.007, and 0.005 using Spearman, Kendall, and Pearson correlation methods, respectively. Thus, we can conclude that microbial interdependence profiles over the studied period are statistically different between the pregnant women and those nonpregnant women.

18.2.3 Implement NMIT Using QIIME 2

Example 18.2: Early Childhood Antibiotics and the Microbiome (ECAM)

ECAM (Bokulich et al. 2016) is a longitudinal human study. This study monitored the microbiome profiles of 43 children since their birth over the first 2 years. During the first year of life, stool samples were collected at each visit from these infants. Of the 43 children, 24 were vaginally delivered and 19 were cesarean-delivered. Thirty-one of infants were dominantly (>50% of feedings) breast-fed, and 12 of them were dominantly formula-fed for the first 3 months of life.

The data of this study have been used for testing the NMIT method in the original NMIT paper (Zhang et al. 2017) and QIIME 2 documentation. Both tested whether the microbial interdependence is interrupted during babies’ first year of life between two delivery modes (vaginally and cesarean-delivered) with either adjusting their gender, diet type (breast feeding or on formula), and other covariates (in the original NMIT paper) or without adjusting covariates (in QIIME 2 documentation). Here, we use the same dataset to illustrate NMIT via QIIME 2 to evaluate whether the microbial interdependence is different between breast-dominant children and formula-dominant children during their first-year life. Each child had collected multiple stool samples during this period. In total, 505 stool samples had been collected; at average each child had collected 12 stool samples and minimally had 7 samples. The abundance data had 498 different taxa at the genus level. The 23 major taxa with their relative abundance >0.1% in >15% of the total samples were used for original NMIT paper and QIIME 2 analyses.

In this section, we implement NMIT via QIIME 2 using the data from this ECAM study. The datasets were downloaded from the QIIME 2 website, where the data were used to run NMIT to test delivery modes using the Pearson’s method to calculate the temporal correlation matrix for each sample. Here, we use the data to illustrate the NMIT method via QIIME 2 to test diet types. We use the Kendall’s method to calculate the temporal correlation matrix and the Frobenius norm to evaluate the distance between any two correlation matrices.

NMIT calculates the interdependence correlation matrix based on the repeated microbiome measurements within each subject. Thus, the different time points across subjects or missing observations for some subjects do not affect the value of interdependence correlation matrix (Zhang et al. 2017). Therefore, NMIT does not require that samples are collected at identical time points and hence is robust to missing samples. However, based on QIIME 2 documentation, this may have negative effects on data quality if subjects are highly under-sampled due to missing values or may not provide biologically meaningful results if subjects’ sampling times do not overlap. QIIME 2 also suggests at least 5–6 samples at each time point per subject should be used. For a very large feature table, NMIT may take a long time to run, thus to improve runtime, in practice, the low-abundance features will be filtered and the feature tables on taxonomy will be collapsed such as into to genus level. Here, the illustrated feature tables on taxonomy are genus-level taxa with a relative abundance >0.1% in more than 15% of the total samples.

First, we need to activate QIIME 2. Open the terminal and type: source activate qiime2-2022.2, which is based on which version of QIIME 2 on your computer. When the terminal appears “(qiime2-2022.2),” it indicates qiime2-2022.2 version has been activated. We can work on this version now.

We already create a folder NMIT in the path:QIIME2-Biostatistics/longitudinal/NMIT, so we can use cd QIIME2R-Biostatistics/longitudinal/NMIT to this folder. If this folder did not exist in your computer, you can create one: mkdir QIIME2R-Biostatistics/longitudinal/NMIT.

Next, we can take four main steps to implement NMIT via QIIME 2’s qiime longitudinal nmit command to evaluate longitudinal sample similarity as a function of temporal microbial composition. We illustrate NMIT using Kendall correlation method, the reader can easily use Pearson and Spearman methods just replacing “kendall” in the p-corr-method parameter column with either “pearson” or “spearman.”

Step 1: Perform NMIT using qiime longitudinal nmit command.

source activate qiime2-2022.2

cd QIIME2R-Biostatistics/longitudinal/NMIT

qiime longitudinal nmit\

--i-table FeatureRelativeTableEcam.qza\

--m-metadata-file SampleMetadataEcam.tsv\

--p-individual-id-column studyid\

--p-corr-method kendall\

--p-dist-method fro\

--o-distance-matrix NmitDMKendallEcam.qza

where (1) the required input table is a FeatureTable[RelativeFrequency] artifact (here, FeatureRelativeTableEcam.qza). (2) The sample metadata file containing individual-id-column (here, SampleMetadataEcam.tsv) is also required, which is specified by the m-metadata-file. (3) The p-individual-id-column parameter is also required, which is used to specify the metadata column containing study IDs for individual subjects (here, studyid). (4) The p-corr-method parameter is used to specify the temporal correlation method to be applied. We can choose one of them from: “kendall”(default), “pearson,” “spearman.” This column is required too. (5) The p-dist-method parameter is optional. It is used to specify the temporal distance method either “fro” (Frobenius norm) or “nuc” (nuclear norm) with the default is the Frobenius norm. (6) Finally, we need to name and save the resulting distance matrix (i.e., NmitDMKendallEcam.qza) via the o-distance-matrix parameter, which is an artifact.

The saved output of the distance matrix artifact can be passed to other QIIME 2’s commands for statistical testing of significance and visualization.

Step 2: Conduct statistical significance testing of distance matrix via qiime diversity beta-group-significance command.

qiime diversity beta-group-significance \

--i-distance-matrix NmitDMKendallEcam.qza \

--m-metadata-file SampleMetadataEcam.tsv \

--m-metadata-column diet \

--o-visualization NmitDietKendallEcam.qzv

This step actually performs PERMANOVA to evaluate whether between-group distances are larger than within-group distance. As we stated in Chap. 11 (Sect. 11.3.2), it calls the function adonis() in vegan package. By QIIME2 view, we can see the distances within the dominantly breast-fed and dominantly formula-fed groups are smaller than the distance between these two groups. The group difference related to the microbial interdependence is statistically significant with P-value of 0.045 based on Kendall method. Thus, we concluded that the diet type did alter the microbial interdependence in early childhood.

Step 3: Compute principal coordinates via qiime diversity pcoa command.

qiime diversity pcoa \

--i-distance-matrix NmitDMKendallEcam.qza \

--o-pcoa NmitPCKendallEcam.qza

Step 4: Use Emperor to visualize similarities among subjects via qiime emperor plot command.

Finally, we can also use Emperor plot to visualize similarities among subjects (Fig. 18.1).

# Figure 18.1

qiime emperor plot \

--i-pcoa NmitPCKendallEcam.qza \

--m-metadata-file SampleMetadataEcam.tsv \

--o-visualization NmitEmperorKendallEcam.qzv

Fig. 18.1
Principal coordinates generated by PCoA and the similarities among subjects plotted by Emperor plot

Please note that the computed principal coordinates and similarities visualized by Emperor plot are among subjects but not among individual samples.

18.2.4 Remarks on NMIT

In the literature, microbiome data have been considered having the unique compositional structure. To address the spurious correlation problem with compositional data (Pearson 1896; Ranganathan and Borges 2011), a family of log-ratio transformations has been proposed (Xia et al. 2018a), which consists of the additive log-ratio (alr) transformation (Aitchison 1986) (p.113), the centered log-ratio (clr) transformation (Aitchison 2003), and the isometric log-ratio (ilr) transformation (Egozcue et al. 2003) (see details on Chap. 10 in this book).

The clr-representation of composition x = (x₁, …, x_i, …, x_D) is defined as the logarithm of the components after dividing by the geometric mean of x (a vector of relative abundance for taxa), which is most often implemented to remove the possible compositional effect:

$clr(x)=\left[\ln \left(\frac{x_1}{g_m(x)}\right),\cdots \ln \left(\frac{x_i}{g_m(x)}\right),\cdots \ln \left(\frac{x_D}{g_m(x)}\right)\right],$

(18.3)

with g_m(x) = $\sqrt[D]{x_1\cdot {x}_2\cdots {x}_D}$ ensuring that the sum of the elements of clr(x) is zero. The clr transforms the composition to the Euclidean sample space, and hence providing the possibility of using standard unconstrained statistical methods for analyzing compositional data (Aitchison 2003).

NMIT does not use the clr transformation in its test based on the following two reasons: (1) The correlation is calculated between any pair of taxa using the microbiome measurements at the multiple time points within each subject. The original correlation will be contaminated using the transformation because the geometric mean function varies across different time points and the transformation is performed at each time point independently. (2) The compositional effect (e.g., the spurious correlation) of the microbiome data is mild (Friedman and Alm 2012) compared to the ecology data because the number of taxa in the microbiome studies is usually larger than that in the ecology studies.

NMIT is a statistical method for discovering the microbiome data structure. This test is one of the first steps to understand the dynamic nature of microbiome data. However, the NMIT methods have the drawbacks, for example:

As a distance-based testing method, NMIT provides an overall assessment of the group difference in terms of the interdependent relationship (microbial interdependence similarity) among taxa (Zhang et al. 2017). Further steps are required to identify the specific key taxa that are associated with outcomes or biological covariates.
NMIT mainly uses the correlation tests (Pearson correlation, Kendall’s rank correlation, and maximal information coefficient (MIC)). However, using correlation coefficients to detect dependencies of microbial taxa cannot avoid detecting spurious correlations due to compositionality (Chen and Li 2016) and could be severely underpowered owing to the relatively low number of samples (Layeghifard et al. 2018).
The study demonstrated Kendall method is always comparable or has a slight power edge over Pearson method, while MIC method has less power than Pearson and Kendall methods (Zhang et al. 2017). These findings are not all consistent with the literature. Thus, these arguments need to be further validated.
Current version of NMIT cannot handle time-varying covariate (Zhang et al. 2017).

Recently NMIT and linear mixed-effects models (LMMs) (Lindstrom and Bates 1988) were built in QIIME 2 as the two q2-longitudinal plugins to facilitate streamlined analysis and visualization of longitudinal and paired sample data sets (Bokulich et al. 2018). When the reader performs statistical analysis through QIIME 2, the user needs to know the advantages and disadvantages of the underlying methods.

18.3 The Large P Small N Problem

As we mentioned in Sect. 18.1, with many microbial taxa presented in a limited sample size, multivariate analysis of microbiome data can result in the large P small N problem (Xia and Sun 2022). The “large P small N” problem exist in both univariate and multivariate analyses.

In univariate analysis, the “large P small N” problem refers to the statistical modeling and inference problem in analyzing a response variable Y, with the larger number of covariates P in a set of vector X = (X₁, X₂, …, X_P) and the substantially smaller number of available sample size N (i.e., N ≪ P) (Chakraborty et al. 2012). Here, P usually refers to the dimension of predictors (independent variables), and N refers to the sample size. That is, the design matrix (matrix of predictors) of N × P (Bernardo et al. 2003) is high-dimensional, so that it is challenging to directly estimate the parameters. The problem is more deteriorated if multicollinearity exists in the predictors.

In multivariate analysis, the “large P small N” problem means that there are more features than data points, where P refers to the dimension of response (dependent) variables. In terms of data matrix, P refers to the number of columns (graphically dimensional space), and N refers to the number of rows. The “large P small N” problem refers to small data points (samples: N), but each data point contains large P features (dimensional space).

Such situations usually arise in multiple-omics data. For example, in genome-wide association studies (GWAS), the “large P small N” problem could occur (Diao and Vidyashankar 2013) because the number of observations N is usually less hundreds or thousands, whereas the number of markers P is approximately hundreds of thousands (Mei and Wang 2016). One main characteristic of metabolomics datasets is that the number of metabolites (P) is greater than the number of observations (N) (Johnstone and Titterington 2009). Microbiome sequence data sets are also high dimensional with the number of features (ASV/OTU or taxa) being much higher than the number of samples (Xia et al. 2018b; Xia 2020; Xia and Sun 2022).

In statistical modeling and inference, a larger P needs a larger N. When the large P with small N, then it results in the problem of “curse of dimensionality.” The problem of “large P small N” makes traditional linear regression methods infeasible because design matrix is singular, is underdetermined, and hence is no longer invertible, and no unique least-squares solution exists as well. When the multiple responses are correlated, the “large P small N” problem is more complicated.

In analysis of high dimensional omics data, the general strategy is either to reduce the number of variables to keep only significant variables in models or to project variables to lower dimension. Ideally, the multivariate analysis methods used are capable of dealing with significant amounts of collinearity in data matrix. Thus, some kinds of special statistical methods, such as variable selection, principal component related methods (of which PCA and PLS are prime examples), penalized likelihood, constraint or shrinkage methods, Bayesian methods, and ridge regressions are often used.

18.4 Summary

In the last chapter, we first provided an overview of multivariate longitudinal microbiome methods, including multivariate distance/kernel-based longitudinal models and multivariate integration of multi-omics methods, and briefly discussed the approaches of univariate analysis and multivariate analysis. We then specifically introduced the NMIT and illustrated its implements using R and QIIME 2 with real microbiome data. Finally, we discussed the large P small N problem in general statistics and especially in multiple-omics studies.

References

Aitchison, J. 1986. The statistical analysis of compositional data. Monographs on statistics and applied probability. London: Chapman & Hall Ltd.. (Reprinted in 2003 with additional material by The Blackburn Press). 416 p.
Aitchison, John. 2003. A concise guide to compositional data analysis. In 2nd compositional data analysis workshop. Girona, Italy, 2003.
Anderson, Marti J. 2001. A new method for non-parametric multivariate analysis of variance. Austral Ecology 26 (1): 32–46. https://doi.org/10.1111/j.1442-9993.2001.01070.pp.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1442-9993.2001.01070.pp.x.Crossref
Bernardo, J.M., M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A. Smith, and M. West. 2003. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics 7: 733–742.
Bodein, Antoine, Olivier Chapleur, Arnaud Droit, and Kim-Anh Lê Cao. 2019. A generic multivariate framework for the integration of microbiome longitudinal studies with other data types. Frontiers in Genetics 10: 963–963. https://doi.org/10.3389/fgene.2019.00963. https://pubmed.ncbi.nlm.nih.gov/31803221; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6875829/.CrossrefPubMedPubMedCentral
Bokulich, Nicholas A., Jennifer Chung, Thomas Battaglia, Nora Henderson, Melanie Jay, Huilin Li, Arnon D. Lieber, Fen Wu, Guillermo I. Perez-Perez, and Yu Chen. 2016. Antibiotics, birth mode, and diet shape microbiome maturation during early life. Science Translational Medicine 8 (343): 343ra82-343ra82.Crossref
Bokulich, Nicholas A., Matthew R. Dillon, Yilong Zhang, Jai Ram Rideout, Evan Bolyen, Huilin Li, Paul S. Albert, and J. Gregory Caporaso. 2018. q2-longitudinal: Longitudinal and paired-sample analyses of microbiome data. mSystems 3 (6): e00219–18. https://doi.org/10.1128/mSystems.00219-18. https://msystems.asm.org/content/msys/3/6/e00219-18.full.pdf.CrossrefPubMedPubMedCentral
Bray, J. Roger, and J.T. Curtis. 1957. An ordination of the upland forest communities of Southern Wisconsin. Ecological Monographs 27 (4): 325–349. https://doi.org/10.2307/1942268. https://esajournals.onlinelibrary.wiley.com/doi/abs/10.2307/1942268.
Breslow, N.E., and D.G. Clayton. 1993. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88 (421): 9–25. https://doi.org/10.2307/2290687. http://www.jstor.org/stable/2290687.Crossref
Chakraborty, Sounak, Malay Ghosh, and Bani K. Mallick. 2012. Bayesian nonlinear regression for large p small n problems. Journal of Multivariate Analysis 108: 28–40. https://doi.org/10.1016/j.jmva.2012.01.015. http://www.sciencedirect.com/science/article/pii/S0047259X12000164.Crossref
Chen, Eric Z., and Hongzhe Li. 2016. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics 32 (17): 2611–2617. https://doi.org/10.1093/bioinformatics/btw308.CrossrefPubMedPubMedCentral
Cristianini, Nello, and John Shawe-Taylor. 2000. An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press.Crossref
Diao, Guoqing, and Anand N. Vidyashankar. 2013. Assessing genome-wide statistical significance for large p small n problems. Genetics 194 (3): 781–783. https://doi.org/10.1534/genetics.113.150896. https://pubmed.ncbi.nlm.nih.gov/23666935; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3697980/.CrossrefPubMedPubMedCentral
Egozcue, Juan José, Vera Pawlowsky-Glahn, Glòria Mateu-Figueras, and Carles Barcelo-Vidal. 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology 35 (3): 279–300.Crossref
Friedman, Jonathan, and Eric J. Alm. 2012. Inferring correlation networks from genomic survey data. PLoS Computational Biology 8 (9): e1002687.CrossrefPubMedPubMedCentral
Jaccard, Paul. 1912. The distribution of the flora in the alpine zone.1. New Phytologist 11 (2): 37–50. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x. https://nph.onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-8137.1912.tb05611.x.
Johnstone, Iain M., and D. Michael Titterington. 2009. Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 367 (1906): 4237–4253. https://doi.org/10.1098/rsta.2009.0159. https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2009.0159.Crossref
Koh, Hyunwook, Yutong Li, Xiang Zhan, Jun Chen, and Ni Zhao. 2019. A distance-based kernel association test based on the generalized linear mixed model for correlated microbiome studies. Frontiers in Genetics 10 (458). https://doi.org/10.3389/fgene.2019.00458. https://www.frontiersin.org/article/10.3389/fgene.2019.00458.
Layeghifard, Mehdi, David M. Hwang, and David S. Guttman. 2018. Constructing and analyzing microbiome networks in R. In Microbiome analysis, ed. Robert G. Beiko et al., 243–266. Springer.Crossref
Lin, Xihong. 1997. Variance component testing in generalised linear models with random effects. Biometrika 84 (2): 309–326. https://doi.org/10.1093/biomet/84.2.309.Crossref
Lindstrom, Mary J., and Douglas M. Bates. 1988. Newton-Raphson and EM algorithms for linear mixed-effects models for repeated-measures datam. Journal of the American Statistical Association 83 (404): 1014–1022. https://doi.org/10.2307/2290128. https://www.jstor.org/stable/2290128.Crossref
McArdle, Brian H., and Marti J. Anderson. 2001. Fitting multivariate models to community data: A comment on distance-based redundancy analysis. Ecology 82 (1): 290–297. https://doi.org/10.1890/0012-9658(2001)082[0290:Fmmtcd]2.0.Co;2. https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1890/0012-9658%282001%29082%5B0290%3AFMMTCD%5D2.0.CO%3B2.Crossref
Mei, Bujun, and Zhihua Wang. 2016. An efficient method to handle the ‘large p, small n’problem for genomewide association studies using Haseman–Elston regression. Journal of Genetics 95 (4): 847–852.CrossrefPubMed
Pearson, Karl. 1896. Mathematical contributions to the theory of evolution – On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London 60 (359–367): 489–498.
Ranganathan, Yuvaraj, and Renee M. Borges. 2011. To transform or not to transform: That is the dilemma in the statistical analysis of plant volatiles. Plant Signaling & Behavior 6 (1): 113–116.Crossref
Reiss, Philip T., M. Henry, H. Stevens, Zarrar Shehzad, Eva Petkova, and Michael P. Milham. 2010. On distance-based permutation tests for between-group comparisons. Biometrics 66 (2): 636–643.CrossrefPubMed
Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. 2011. Detecting novel associations in large data sets. Science 334 (6062): 1518–1524.CrossrefPubMedPubMedCentral
Romero, Roberto, Sonia S. Hassan, Pawel Gajer, Adi L. Tarca, Douglas W. Fadrosh, Janine Bieda, Piya Chaemsaithong, Jezid Miranda, Tinnakorn Chaiworapongsa, and Jacques Ravel. 2014. The vaginal microbiota of pregnant women who subsequently have spontaneous preterm labor and delivery and those with a normal delivery at term. Microbiome 2 (1): 18. https://doi.org/10.1186/2049-2618-2-18.CrossrefPubMedPubMedCentral
Tang, Zheng-Zheng, Guanhua Chen, and Alexander V. Alekseyenko. 2016. PERMANOVA-S: Association test for microbial community composition that accommodates confounders and multiple distances. Bioinformatics 32 (17): 2618–2625. https://doi.org/10.1093/bioinformatics/btw311.CrossrefPubMedPubMedCentral
Xia, Yinglin. 2020. Correlation and association analyses in microbiome study integrating multiomics in health and disease. Progress in Molecular Biology and Translational Science 171: 309–491.CrossrefPubMed
Xia, Yinglin, and Jun Sun. 2022. Statistical data analysis of microbiomes and metabolomics. Washington, DC: American Chemical Society.Crossref
Xia, Yinglin, Jun Sun, and Ding-Geng Chen. 2018a. Compositional analysis of microbiome data. In Statistical analysis of microbiome data with R, ed. Yinglin Xia et al., 331–393. Springer.Crossref
———. 2018b. What are microbiome data? In Statistical analysis of microbiome data with R, ed. Yinglin Xia et al., 29–41. Springer.Crossref
Zhan, Xiang, Lingzhou Xue, Haotian Zheng, Anna Plantinga, Michael C. Wu, Daniel J. Schaid, Ni Zhao, and Jun Chen. 2018. A small-sample kernel association test for correlated data with application to microbiome association studies. Genetic Epidemiology 42 (8): 772–782. https://doi.org/10.1002/gepi.22160. https://onlinelibrary.wiley.com/doi/abs/10.1002/gepi.22160.CrossrefPubMed
Zhang, Yilong, Sung Won Han, Laura M. Cox, and Huilin Li. 2017. A multivariate distance-based analytic framework for microbial interdependence association test in longitudinal study. Genetic Epidemiology 41 (8): 769–778. https://doi.org/10.1002/gepi.22065. https://www.ncbi.nlm.nih.gov/pubmed/28872698; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5696116/.CrossrefPubMedPubMedCentral